from config import *
%matplotlib inline
= plt.get_cmap('viridis') cmap
User Cohorts Analysis
Understanding user behavior over time is crucial for any business, especially in fast-growing startups where data-driven decisions can define success. One of the most effective ways to analyze retention, engagement, and growth is through cohort analysis.
A user cohort table segments users based on shared characteristics—such as sign-up month or first purchase—and tracks their activity across time. This approach helps businesses measure customer retention, assess the impact of product changes, and optimize marketing strategies.
In this notebook, we’ll walk through the process of analyzing a cohort table step by step. Using Python, we’ll transform raw data into actionable insights, visualizing retention curves and uncovering trends that can inform strategic decisions. Whether you’re working in SaaS, e-commerce, or mobile apps, this guide will equip you with the tools to extract meaningful patterns from user data.
Let’s dive in.
Right-aligned table
In the following table: * every row corresponds to a single cohort of users, in this case those who made their first purchase in the given month * every column indicates the number of users still active in the given month
For example, the number of users who made their first purchase in March and are still engaging with the platform in June are 9,468
= pd.read_csv('user_cohorts.csv')
nc_cohorts_right ={'Num Users': 'month'}, inplace=True)
nc_cohorts_right.rename(columns= nc_cohorts_right['month']
ind 'month', inplace=True)
nc_cohorts_right.set_index(
= nc_cohorts_right.fillna(0)
nc_cohorts_right_filled
= nc_cohorts_right.style.background_gradient(cmap=cmap, axis=1, vmax=80000, vmin=1000).format('{:,.0f}')
styled_df
= styled_df.apply(
styled_df lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)
styled_df
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||||
Jan | 689 | 414 | 379 | 342 | 347 | 362 | 344 | 349 | 347 | 339 | 342 | 337 |
Feb | nan | 5,595 | 3,039 | 2,633 | 2,565 | 2,408 | 2,366 | 2,318 | 2,109 | 2,166 | 2,051 | 2,026 |
Mar | nan | nan | 20,041 | 11,037 | 9,872 | 9,468 | 8,807 | 8,682 | 8,025 | 7,788 | 7,504 | 7,057 |
Apr | nan | nan | nan | 30,278 | 15,831 | 14,449 | 13,351 | 12,587 | 11,856 | 11,180 | 10,988 | 9,999 |
May | nan | nan | nan | nan | 35,630 | 16,924 | 14,768 | 13,840 | 12,966 | 12,260 | 12,118 | 11,082 |
Jun | nan | nan | nan | nan | nan | 47,489 | 21,653 | 18,921 | 17,141 | 15,806 | 15,654 | 14,102 |
Jul | nan | nan | nan | nan | nan | nan | 78,425 | 34,774 | 29,399 | 26,073 | 25,229 | 23,016 |
Aug | nan | nan | nan | nan | nan | nan | nan | 109,117 | 43,172 | 35,336 | 33,469 | 30,535 |
Sep | nan | nan | nan | nan | nan | nan | nan | nan | 116,187 | 40,402 | 35,154 | 31,603 |
Oct | nan | nan | nan | nan | nan | nan | nan | nan | nan | 117,185 | 34,462 | 29,055 |
Nov | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 107,597 | 32,920 |
Dec | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 97,164 |
= nc_cohorts_right.Dec.sum()
retained_clients print(f"Total reatined clients: {int(retained_clients)}")
Total reatined clients: 288895
We can more intuitively visualize the table with a simple chart.
= cmap(np.linspace(0, 1, nc_cohorts_right.shape[1]))
colors
= plt.subplots(figsize=(8, 5))
fig, ax
=nc_cohorts_right_filled.columns, colors=colors, alpha=0.8)
ax.stackplot(nc_cohorts_right_filled.index, nc_cohorts_right_filled, labels
='upper left', title='Cohort')
ax.legend(loc'Users Per Month', loc='left')
ax.set_title('Month')
ax.set_xlabel('Number of Users')
ax.set_ylabel(False)
ax.grid(
plt.show()
In this case, I noticed that the growth is slowing down. Remember, you always want to confirm your visual intuition with data: in this case, i approximated the growth curve with a known (continuous) function. As the function approximates well the curve, I can use its first derivative to draw conclusions about the instantaneous growth.
= nc_cohorts_right.sum()
trend = np.arange(1, len(trend) + 1)
t
def logistic_function(t, L, k, t0):
return L / (1 + np.exp(-k * (t - t0)))
= curve_fit(logistic_function, t, trend, p0=[trend.max(), 1, 0])
[L, k, t0], _
= logistic_function(t, L, k, t0)
approx = k * L * np.exp(-k * (t - t0)) / (1 + np.exp(-k * (t - t0)))**2
intensity
= 1 - np.sum((trend - approx) ** 2) / np.sum((trend - np.mean(trend)) ** 2)
R2
= plt.subplots(2, 1, figsize=(8, 6))
fig, ax 0].plot(trend, label="Trend")
ax[0].plot(approx, label="Logistic Fit")
ax[0].set_title(f'Approximation of Growth with Logistic Function: R² = {round(R2, 3)}')
ax[0].legend()
ax[
1].plot(intensity, color="red", label="Derivative")
ax[1].set_title("Derivative of Logistic Function (Growth Intensity)")
ax[1].legend()
ax[
plt.tight_layout()
Left-aligned table
In the following table: * every row corresponds to a single cohort of users, in this case those who made their first purchase in the given month * every column indicates the number of users still active after n months, where n is the column header. n=0 corresponds to the month in which the users made their first purchase
This table is very useful, as it allows to analyze the behaviour of different cohorts during their lifetime
= nc_cohorts_right.copy().to_numpy()
nc for i, row in enumerate(nc):
= np.argwhere(np.isnan(row))
a if a.size != 0:
= a.max() + 1
start = row[start:]
temp if temp.size < 12:
= np.append(temp, np.full(shape=(12 - temp.size), fill_value=np.nan))
temp = temp
nc[i] else:
pass
= pd.DataFrame(nc, index=ind)
nc_cohorts_left
= nc_cohorts_left.style.background_gradient(cmap=cmap, axis=1, vmax=80000, vmin=1000).format('{:,.0f}')
styled_df
= styled_df.apply(
styled_df lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)
styled_df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||||
Jan | 689 | 414 | 379 | 342 | 347 | 362 | 344 | 349 | 347 | 339 | 342 | 337 |
Feb | 5,595 | 3,039 | 2,633 | 2,565 | 2,408 | 2,366 | 2,318 | 2,109 | 2,166 | 2,051 | 2,026 | nan |
Mar | 20,041 | 11,037 | 9,872 | 9,468 | 8,807 | 8,682 | 8,025 | 7,788 | 7,504 | 7,057 | nan | nan |
Apr | 30,278 | 15,831 | 14,449 | 13,351 | 12,587 | 11,856 | 11,180 | 10,988 | 9,999 | nan | nan | nan |
May | 35,630 | 16,924 | 14,768 | 13,840 | 12,966 | 12,260 | 12,118 | 11,082 | nan | nan | nan | nan |
Jun | 47,489 | 21,653 | 18,921 | 17,141 | 15,806 | 15,654 | 14,102 | nan | nan | nan | nan | nan |
Jul | 78,425 | 34,774 | 29,399 | 26,073 | 25,229 | 23,016 | nan | nan | nan | nan | nan | nan |
Aug | 109,117 | 43,172 | 35,336 | 33,469 | 30,535 | nan | nan | nan | nan | nan | nan | nan |
Sep | 116,187 | 40,402 | 35,154 | 31,603 | nan | nan | nan | nan | nan | nan | nan | nan |
Oct | 117,185 | 34,462 | 29,055 | nan | nan | nan | nan | nan | nan | nan | nan | nan |
Nov | 107,597 | 32,920 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
Dec | 97,164 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
print(f"Total initial users: {int(nc_cohorts_left[0].sum())}")
Total initial users: 765395
Assessing retention rates
With a left-aligned table, we can analyse retention rates for every cohort. In this case, retention rates are calculated as: \[ \frac{Active\, users\, at\, month\, n}{Active\, users\, at\, month\, n-1} \]
= nc_cohorts_left.copy()
retention
for col in retention.columns:
if col == 0:
= 1
retention[col] else:
= nc_cohorts_left[col]/nc_cohorts_left[0]
retention[col]
= retention.style.background_gradient(cmap=cmap, axis=1, vmax=.60, vmin=.30).format('{:.2%}')
styled_df
= styled_df.apply(
styled_df lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)
styled_df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||||
Jan | 100.00% | 60.14% | 55.07% | 49.64% | 50.36% | 52.54% | 50.00% | 50.72% | 50.36% | 49.28% | 49.64% | 48.91% |
Feb | 100.00% | 54.33% | 47.06% | 45.85% | 43.04% | 42.28% | 41.44% | 37.69% | 38.72% | 36.66% | 36.22% | nan% |
Mar | 100.00% | 55.07% | 49.26% | 47.24% | 43.94% | 43.32% | 40.04% | 38.86% | 37.44% | 35.21% | nan% | nan% |
Apr | 100.00% | 52.29% | 47.72% | 44.09% | 41.57% | 39.16% | 36.92% | 36.29% | 33.03% | nan% | nan% | nan% |
May | 100.00% | 47.50% | 41.45% | 38.84% | 36.39% | 34.41% | 34.01% | 31.10% | nan% | nan% | nan% | nan% |
Jun | 100.00% | 45.60% | 39.84% | 36.10% | 33.28% | 32.96% | 29.70% | nan% | nan% | nan% | nan% | nan% |
Jul | 100.00% | 44.34% | 37.49% | 33.25% | 32.17% | 29.35% | nan% | nan% | nan% | nan% | nan% | nan% |
Aug | 100.00% | 39.56% | 32.38% | 30.67% | 27.98% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Sep | 100.00% | 34.77% | 30.26% | 27.20% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Oct | 100.00% | 29.41% | 24.79% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Nov | 100.00% | 30.60% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Dec | 100.00% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
We can then easily visualize the retention trend across different cohorts with a line chart. Ideally, retention rates should improve with time, although this is not the case.
= plt.subplots(1, 1)
fig, ax ='s', ax=ax, color=colors, alpha=0.67)
retention.T.plot(marker
'Hanging Ribbons', loc='left')
ax.set_title('% Of Cohort Retained')
ax.set_ylabel('Time')
ax.set_xlabel(
=1))
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax
ax.legend(='Cohort',
title='upper center',
loc=5,
ncol=True
frameon
)
plt.tight_layout() plt.show()
We can easily check that retention is negatively correlated with time (and size of the cohorts)
from sklearn.preprocessing import minmax_scale
= minmax_scale(range(1, len(nc_cohorts_left.index)+1), axis=0)
scaled_nc_cohorts_left = minmax_scale(retention.iloc[:, 1:4], axis=0)
scaled_retention
= pd.DataFrame(
df_scaled =["nc_cohorts_left"]
scaled_nc_cohorts_left, columns=["retention_1", "retention_2", "retention_3"]))
).join(pd.DataFrame(scaled_retention, columns
= df_scaled.corr()
corr_matrix
=True, cmap="coolwarm", fmt=".2f", cbar=False)
sns.heatmap(corr_matrix, annot"Correlation Matrix")
plt.title( plt.tight_layout()
Assessing churn rates
With a left-aligned table, we can also analyse churn rates for every cohort. Churn rates in the following table are calculated as: \[ \frac{Churned\, users\, at\, month\, n}{Active\, users\, at\, month\, n-1} \]
= nc_cohorts_left.copy()
churn
for col in churn.columns:
if col == 0:
= 0
churn[col] else:
= (nc_cohorts_left[col-1]-nc_cohorts_left[col])/nc_cohorts_left[col-1]
churn[col]
= churn.style.background_gradient(cmap=cmap, axis=1, vmax=.2, vmin=0).format('{:.2%}')
styled_df
= styled_df.apply(
styled_df lambda row: ['background-color: white; color: white' if pd.isna(v) else '' for v in row]
)
styled_df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||||
Jan | 0.00% | 39.86% | 8.43% | 9.87% | -1.46% | -4.32% | 4.83% | -1.45% | 0.71% | 2.16% | -0.74% | 1.46% |
Feb | 0.00% | 45.67% | 13.38% | 2.56% | 6.13% | 1.76% | 2.00% | 9.04% | -2.72% | 5.30% | 1.22% | nan% |
Mar | 0.00% | 44.93% | 10.56% | 4.10% | 6.98% | 1.42% | 7.56% | 2.95% | 3.65% | 5.95% | nan% | nan% |
Apr | 0.00% | 47.71% | 8.73% | 7.60% | 5.72% | 5.81% | 5.70% | 1.72% | 8.99% | nan% | nan% | nan% |
May | 0.00% | 52.50% | 12.74% | 6.29% | 6.31% | 5.45% | 1.16% | 8.55% | nan% | nan% | nan% | nan% |
Jun | 0.00% | 54.40% | 12.62% | 9.40% | 7.79% | 0.96% | 9.92% | nan% | nan% | nan% | nan% | nan% |
Jul | 0.00% | 55.66% | 15.46% | 11.31% | 3.24% | 8.77% | nan% | nan% | nan% | nan% | nan% | nan% |
Aug | 0.00% | 60.44% | 18.15% | 5.28% | 8.77% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Sep | 0.00% | 65.23% | 12.99% | 10.10% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Oct | 0.00% | 70.59% | 15.69% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Nov | 0.00% | 69.40% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
Dec | 0.00% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% | nan% |
It’s interesting to calculate and plot the weighted average churn rate for every month. In this case, most of the users disengage after the first month: this is because they were attracted with offers and promotions, and left the platforms as soon as they used those incentives.
= nc_cohorts_left.copy()
churned = nc_cohorts_left.copy()
retained
for col in churned.columns:
if col == 0:
= 0
churned[col] else:
= (nc_cohorts_left[col-1]-nc_cohorts_left[col])
churned[col]
= np.zeros(11)
avg_churn
= retained.iloc[:-1,:].sum()
retained_sum = churned.iloc[:-1,1:].sum()
churned_sum
for i in range(len(avg_churn)):
= churned_sum[i+1] / retained_sum[(i)]
avg_churn[i]
avg_churn
array([0.61894928, 0.12467046, 0.0687431 , 0.05115784, 0.03639244,
0.04167227, 0.03471718, 0.0376834 , 0.02842538, 0.00237718,
0.00210748])
= churn.iloc[:-1,1:].std()
std
= plt.subplots(1, 1,figsize=(8,5))
fig, ax range(1, 12, 1), avg_churn, label='Mean', color='#00A082')
ax.plot(
range(1, 12, 1), avg_churn - std, avg_churn + std, alpha=0.2, label='±1 Std Dev', color='#00A082')
plt.fill_between(
'Month')
plt.xlabel('% of churned customers (relative to prev. month)')
plt.ylabel('Churned customers in lifetime month', loc='left')
plt.title(=1))
plt.gca().yaxis.set_major_formatter(PercentFormatter(xmax
plt.legend() plt.tight_layout()
I then wanted to see how many additional clients could’ve been retained at the end of the year by reducing the churn rate by 5 percentage points in the first month
= avg_churn.copy()
avg_churn_proj 0] = avg_churn_proj[0]-.05
avg_churn_proj[= nc_cohorts_left[0].values
starting_users = np.zeros([12,12])
projection 0] = starting_users
projection[:,for i in range(11):
11-i),i+1] = projection[:(11-i),i] * (1-avg_churn_proj[i])
projection[:(
= 0
total_clients for i in range(len(projection)):
+= projection[i, 11-i]
total_clients
print(f"Additional clients: {int(total_clients - retained_clients)}")
Additional clients: 42441.379632713506
To gauge the average time clients who get past the first month get engaged with the service, it is userful to calculate the (truncated) lifespan of the users. This is simply calculated as: \[ \frac{1}{avg\, churn\, rate} \]
1/((nc_cohorts_left.sum()[1:-1]*avg_churn[1:]).sum()/nc_cohorts_left.sum()[1:-1].sum())
14.111089070031147